-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: fix TestVMPreemptionPolling data race #135312
Conversation
pkg/cmd/roachtest/test_runner.go
Outdated
@@ -2139,13 +2139,14 @@ func monitorForPreemptedVMs(ctx context.Context, t test.Test, c cluster.Cluster, | |||
if c.IsLocal() || !c.Spec().UseSpotVMs { | |||
return | |||
} | |||
interval := pollPreemptionInterval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to mention we might need a mutex on the previous PR. In general, when there are multiple hooks (for mocking), it might be best to stick them in a mutex-protected struct.
That aside, I'm a bit puzzled by the data race in [1]. The subtests are executed sequentially, which btw you'd still have a race, if t.Parallel()
is enabled. The write on L754 races with the read of the previous subtest, right? That would imply that the runCtx
passed into monitorForPreemptedVMs
wasn't cancelled upon completion. If that's the case, we have a leaking goroutine.
[1] #135267
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, I was going to but forgot to leave a comment saying something along the lines of "I'm not really sure how this fixes the data race because I'm not 100% what the data race is".
That would imply that the runCtx passed into monitorForPreemptedVMs wasn't cancelled upon completion. If that's the case, we have a leaking goroutine.
That was my initial thought too, but I commented out the first of the two tests and was still getting the data race. Even weirder is that when both tests are running, the data race always happens on the second. That would suggest it is reliant on both tests running but it isn't. To further add to that, commenting out the second test causes the first to start failing with a data race.
Not really sure what to make of that 😅 I also tried adding a defer leaktest.AfterTest
which didn't catch anything.
The subtests are executed sequentially, which btw you'd still have a race, if t.Parallel() is enabled
it might be best to stick them in a mutex-protected struct.
Yeah, I thought of that but figured it would be unfortunate to have to add a mutex for a unit test only hook. I can go that route though if we think it's safer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We both missed an essential hint in the data race report–Goroutine 741 (finished)
. At the time when the data race report is printed [1], the goroutine running monitorForPreemptedVMs
has already exited. This confirms that the runCtx
is indeed cancelled.
We can also conclude that it's the write
of the second subtest which races with the read
of the first subtest. The sequence of (scheduled) events is roughly as follows. The goroutine of the first subtest exits before monitorForPreemptedVMs
exits; at this point, tsan
has recorded a read from the corresponding goroutine id. Next, the goroutine for the second subtest is executed; tsan
records the write. A data race is detected. Subsequently, monitorForPreemptedVMs
exits. The reporter iterates through the implicated goroutines, printing the stack traces, at which point Goroutine 741
already finished :)
Your fix works because the read in monitorForPreemptedVMs
must happen before the first subtest goroutine exits, and transitively it happens before the second subtest does the write. s.Run
essentially establishes the happens-before order; from runTest
,
monitorForPreemptedVMs(runCtx, t, c, l)
// This is the call to actually run the test.
s.Run(runCtx, t, c)
Thus, the only read in monitorForPreemptedVMs
now happens before s.Run
exits, and by extension runTest
.
Yeah, I thought of that but figured it would be unfortunate to have to add a mutex for a unit test only hook. I can go that route though if we think it's safer.
It might be a good idea since the reason the current fix works is rather non-trivial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, yes I think that explanation checks out.
but I commented out the first of the two tests and was still getting the data race.
I should've taken a closer look at the data race I was getting, it was due to the defer statement which would fail under race for the same reason, but explains why it was failing with only one subtest running.
It might be a good idea since the reason the current fix works is rather non-trivial.
Done.
f3dcca2
to
befb39d
Compare
pkg/cmd/roachtest/test_runner.go
Outdated
interval := pollPreemptionInterval.interval | ||
// If no interval is set, default to 5 minutes. | ||
if interval == 0 { | ||
interval = 5 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This belongs in the init
function; otherwise, it's not exactly isomorphic to the previous version; e.g., interval
could be reset to 0 (by another thread).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, modulo a tiny init-related comment.
This change switches to pollPreemptionInterval to be a mutex protected struct instead, as multiple unit tests modify it and can lead to a data race without. Fixes: cockroachdb#135267 Epic: none Release note: none
befb39d
to
8e3466d
Compare
TFTRs! bors r=srosenberg, herkolategan |
This change switches to pollPreemptionInterval to be a mutex protected struct instead, as multiple unit tests modify it and can lead to a data race without.
Fixes: #135267
Epic: none
Release note: none